The Spectral Analysis of Distributions (SAD) methodology developed by Kolker et al. (2002) addresses a fundamental limitation of classical Fourier analysis when applied to biological sequence data. Traditional discrete Fourier transforms (DFT) require that the data interval contains an integral number of periods, which constrains period detection to discrete harmonics of the fundamental frequency.
Technical Innovation: SAD overcomes this constraint by fitting cosine waves with continuously adjustable periods to empirical distributions, enabling detection of non-integer periodicities that may reflect genuine biological constraints.
Rationale for Exploration: Protein domains may exhibit preferred sizes, potentially reflecting optimal folding units, functional constraints, or evolutionary modularity. These constraints may manifest as “periodic peaks” in length distributions at multiples of a fundamental domain size.
Simpler Explanation for Students: SAD helps us detect these hidden patterns in protein length data. Think of protein domains like building blocks - if there’s a “preferred size” block that evolution favors, we should see proteins that are 1x, 2x, 3x, or 4x this preferred size more often than random lengths.
This document implements the SAD methodology across four eukaryotic enzyme datasets, maintaining methodological consistency while handling diverse data structures.
Theoretical Foundation: Kolker et al. OMICS 2002 study addressed potential bias from overrepresented protein families by creating a non-redundant Uniprot sequence dataset. This step wou .
Methodological Approach: The following R Scripts strive to following the paper’s SAD and Statistical Mixture Modeling protocol, we remove all proteins with identical descriptions, retaining only the longest variant. This conservative approach ensures that each unique protein function is represented only once, providing a more robust test of genuine length periodicity.
# Function to create non-redundant dataset
create_nonredundant_dataset <- function(proteins) {
# Analyze redundancy based on protein names
redundancy_summary <- proteins %>%
group_by(protein_name_std) %>%
summarise(
count = n(),
min_length = min(length_aa, na.rm = TRUE),
max_length = max(length_aa, na.rm = TRUE),
length_variants = n_distinct(length_aa),
organism_count = n_distinct(organism_name, na.rm = TRUE),
.groups = 'drop'
)
# Create non-redundant dataset by keeping longest variant
nonredundant_proteins <- proteins %>%
group_by(protein_name_std) %>%
slice_max(length_aa, with_ties = FALSE) %>%
ungroup()
return(list(
nonredundant = nonredundant_proteins,
redundancy_summary = redundancy_summary
))
}Mathematical Foundation: The SAD algorithm implements Kolker et al.OMICS 2002 three-step process to detect periodicities in length distributions:
Equation 1 from Paper: For each period j, calculate the non-oscillating background:
Nonosc_j(i) = (1/j) × Σ[k=-int(j/2) to int(j/2)] Total(i+k)
Technical Purpose: This period-specific smoothing eliminates any oscillatory component at frequency j, leaving only the non-periodic background. The window size equals the test period, ensuring complete removal of that specific periodicity.
Biological Interpretation: This step removes the underlying “baseline” protein length distribution, isolating only the periodic components that might reflect evolutionary domain constraints.
Equation 2 from Paper:
Osc_j(i) = Total(i) - Nonosc_j(i)
Purpose: Extract the residual oscillatory component after background removal.
Equation 4 from Paper: Calculate the amplitude as a Fourier coefficient:
A_j = Σ[Osc_j(i) × cos(2πi/j)] / Σ[cos²(2πi/j)]
Mathematical Insight: This is essentially projecting the oscillatory signal onto a cosine basis function of period j. The amplitude A_j quantifies how much the data oscillates at that specific period.
Continuous Spectrum Advantage: Unlike discrete FFT, this approach tests periods at unit increments (j = 2, 3, 4, …, 160), providing a virtually continuous spectrum that can detect non-integer periodicities.
Simple Explanation: Imagine looking for a repeating beat in music. Step 1 removes the background noise, Step 2 isolates the rhythm, and Step 3 measures how strong the beat is at each possible tempo. We test many different tempos to find the strongest rhythm.
# Function to perform Spectral Analysis of Distributions (SAD)
sad_analysis_kolker <- function(data_vector, min_period = 2, max_period = 160,
min_length = 50, max_length = 600) {
# Create a frequency table of counts by length
imin <- min_length
imax <- max_length
# Create a Total vector where Total[i] is the count of proteins with length i
Total <- numeric(imax - imin + 1)
names(Total) <- imin:imax
for (len in data_vector) {
if (len >= imin && len <= imax) {
Total[as.character(len)] <- Total[as.character(len)] + 1
}
}
# Initialize results vectors
periods <- min_period:max_period
amplitudes <- numeric(length(periods))
# For each period to test
for (p_idx in seq_along(periods)) {
j <- periods[p_idx]
# Define the interval excluding half-periods from both ends
half_j <- floor(j/2)
interval_start <- imin + half_j
interval_end <- imax - half_j
# Calculate the number of complete periods in the interval
m <- floor((interval_end - interval_start) / j) - 1
if (m < 1) {
amplitudes[p_idx] <- 0
next
}
# 1. Calculate non-oscillating background using weighted moving average
Nonosc <- numeric(imax - imin + 1)
names(Nonosc) <- imin:imax
for (i in interval_start:interval_end) {
i_str <- as.character(i)
# Sum over window of size j centered at i
window_sum <- 0
for (k in -half_j:half_j) {
idx <- as.character(i + k)
if (idx %in% names(Total)) {
window_sum <- window_sum + Total[idx]
}
}
# Calculate the non-oscillating part
Nonosc[i_str] <- window_sum / j
}
# 2. Calculate oscillating component
Osc <- numeric(imax - imin + 1)
names(Osc) <- imin:imax
for (i in interval_start:interval_end) {
i_str <- as.character(i)
Osc[i_str] <- Total[i_str] - Nonosc[i_str]
}
# 3. Apply cosine Fourier transform
valid_indices <- as.character(interval_start:interval_end)
osc_values <- Osc[valid_indices]
lengths <- as.numeric(valid_indices)
cos_values <- cos(2 * pi * lengths / j)
# Calculate amplitude
numerator <- sum(osc_values * cos_values)
denominator <- sum(cos_values^2)
if (denominator > 0) {
amplitudes[p_idx] <- numerator / denominator
} else {
amplitudes[p_idx] <- 0
}
}
return(data.frame(period = periods, amplitude = amplitudes))
}Theoretical Framework: Beyond period detection, Kolker et al. developed a probabilistic mixture model to test the statistical significance of observed periodicities. This model combines:
Mathematical Formulation: Uses gamma distribution G(α,β) to model the baseline protein length distribution Biological Rationale: Gamma distributions can accommodate various skewness patterns commonly observed in protein length data, providing flexible baseline modeling.
Model Structure: Peaks at multiples μ, 2μ, 3μ, 4μ
with increasing variance: - Peak 1: N(μ, σ²) - Peak 2: N(2μ, 2σ²) - Peak
3: N(3μ, 3σ²)
- Peak 4: N(4μ, 4σ²)
Biological Interpretation: This models proteins composed of “i subunits”, each with mean length μ and standard deviation σ. The increasing variance reflects the compounding uncertainty with multiple domains.
Parameter Set: The model estimates (k+4) parameters: μ, σ, α, β, p₁, p₂, p₃, p₄ Constraint: Peak probabilities must sum to less than 1: Σpᵢ < 1
Statistical Test: Compares unconstrained model L₁ vs. background-only model L₀ Test Statistic: λ = -2ln(L₀/L₁) ~ χ²(k+2) under null hypothesis Null Hypothesis: Protein lengths follow gamma distribution without periodic peaks
Simple Explanation: We build two models - one that allows for preferred protein sizes (with peaks) and one that doesn’t (background only). If the “peaks” model is significantly better, we have evidence for preferred domain sizes. The likelihood ratio test tells us how confident we can be in this conclusion.
# Normalized gamma PDF for discrete distributions
gamma_pdf_normalized <- function(x, alpha, beta, imin, imax) {
if (alpha <= 0 || beta <= 0) return(rep(1e-10, length(x)))
x_values <- imin:imax
raw_pdf <- dgamma(x_values, shape = alpha + 1, scale = beta)
normalized_pdf <- raw_pdf / sum(raw_pdf)
result <- normalized_pdf[x - imin + 1]
return(result)
}
# Normalized normal PDF for discrete distributions
normal_pdf_normalized <- function(x, mu, sigma, imin, imax) {
if (sigma <= 0) return(rep(1e-10, length(x)))
x_values <- imin:imax
raw_pdf <- dnorm(x_values, mean = mu, sd = sigma)
normalized_pdf <- raw_pdf / sum(raw_pdf)
result <- normalized_pdf[x - imin + 1]
return(result)
}
# Mixture model fitting function with robust error handling
fit_mixture_model_kolker <- function(length_counts, period_hint, k, imin, imax) {
lengths <- as.numeric(names(length_counts))
counts <- as.numeric(length_counts)
if (length(lengths) == 0 || length(counts) == 0) {
warning("Empty length counts data provided")
return(list(
params = rep(NA, 4 + k), mu = NA, sigma = NA, alpha = NA, beta = NA,
p_values = rep(NA, k), mu_background = NA, sigma_background = NA,
mu_pure_background = NA, sigma_pure_background = NA,
convergence = 1, log_likelihood = NA, background_log_likelihood = NA,
lambda = NA, p_value = NA
))
}
# Initial parameter estimation
initial_mu <- ifelse(is.null(period_hint) || is.na(period_hint) || period_hint <= 0, 100, period_hint)
initial_sigma <- max(1, initial_mu / 10)
mean_val <- sum(lengths * counts) / sum(counts)
var_val <- sum(counts * (lengths - mean_val)^2) / sum(counts)
initial_beta <- max(1, var_val / mean_val)
if (is.na(initial_beta) || !is.finite(initial_beta)) initial_beta <- 200
initial_alpha <- max(0.1, (mean_val / initial_beta) - 1)
if (is.na(initial_alpha) || !is.finite(initial_alpha)) initial_alpha <- 1
p_init <- rep(0.05, k)
initial_params <- c(initial_mu, initial_sigma, initial_alpha, initial_beta, p_init)
# Negative log-likelihood function
mixture_nll <- function(params, lengths, counts, k, imin, imax) {
mu <- params[1]
sigma <- params[2]
alpha <- params[3]
beta <- params[4]
p_values <- params[5:(4+k)]
if (mu <= 0 || mu > 160 || sigma <= 0 || sigma > 100 ||
alpha < 0 || alpha > 10 || beta <= 0 || beta > 1000 ||
any(p_values < 0) || any(p_values > 1) || sum(p_values) >= 1) {
return(1e10)
}
tryCatch({
g_pdf <- gamma_pdf_normalized(lengths, alpha, beta, imin, imax)
mixture_pdf <- (1 - sum(p_values)) * g_pdf
for (i in 1:k) {
peak_pdf <- normal_pdf_normalized(lengths, i*mu, sqrt(i)*sigma, imin, imax)
mixture_pdf <- mixture_pdf + p_values[i] * peak_pdf
}
mixture_pdf <- pmax(mixture_pdf, 1e-10)
ll <- sum(counts * log(mixture_pdf))
return(-ll)
}, error = function(e) {
return(1e10)
})
}
# Fit full model
fit <- tryCatch({
optim(
par = initial_params,
fn = mixture_nll,
lengths = lengths, counts = counts, k = k, imin = imin, imax = imax,
method = "L-BFGS-B",
lower = c(20, 1, 0.01, 1, rep(0.001, k)),
upper = c(160, 50, 10, 1000, rep(0.2, k)),
control = list(maxit = 1000)
)
}, error = function(e) {
list(par = initial_params, value = 1e10, convergence = 1)
})
# Fit background-only model
bg_nll <- function(params, lengths, counts, imin, imax) {
alpha <- params[1]
beta <- params[2]
if (alpha < 0 || beta <= 0) return(1e10)
tryCatch({
g_pdf <- gamma_pdf_normalized(lengths, alpha, beta, imin, imax)
ll <- sum(counts * log(pmax(g_pdf, 1e-10)))
return(-ll)
}, error = function(e) {
return(1e10)
})
}
bg_fit <- tryCatch({
optim(
par = c(initial_alpha, initial_beta),
fn = bg_nll,
lengths = lengths, counts = counts, imin = imin, imax = imax,
method = "L-BFGS-B",
lower = c(0.01, 1),
upper = c(10, 1000)
)
}, error = function(e) {
list(par = c(initial_alpha, initial_beta), value = 1e10, convergence = 1)
})
# Calculate likelihood ratio test
lambda <- if(fit$value < 1e10 && bg_fit$value < 1e10) {
2 * (bg_fit$value - fit$value)
} else { NA }
df <- k + 2
p_value <- if(!is.na(lambda)) { pchisq(lambda, df = df, lower.tail = FALSE) } else { NA }
# Extract parameters
params <- fit$par
mu <- params[1]; sigma <- params[2]; alpha <- params[3]; beta <- params[4]
p_values <- params[5:(4+k)]
# Calculate derived parameters
mu_background <- if(!is.na(alpha) && !is.na(beta)) beta * (alpha + 1) else NA
sigma_background <- if(!is.na(alpha) && !is.na(beta)) beta * sqrt(alpha + 1) else NA
mu_pure_background <- if(bg_fit$convergence == 0) bg_fit$par[2] * (bg_fit$par[1] + 1) else NA
sigma_pure_background <- if(bg_fit$convergence == 0) bg_fit$par[2] * sqrt(bg_fit$par[1] + 1) else NA
return(list(
params = params, mu = mu, sigma = sigma, alpha = alpha, beta = beta,
p_values = p_values, mu_background = mu_background, sigma_background = sigma_background,
mu_pure_background = mu_pure_background, sigma_pure_background = sigma_pure_background,
convergence = fit$convergence, log_likelihood = -fit$value,
background_log_likelihood = -bg_fit$value, lambda = lambda, p_value = p_value
))
}## Analysis Configuration:
## ======================
## Datasets to analyze: 4
## Length range: 50 - 600 amino acids
## Period range: 2 - 160 amino acids
## Number of peaks (k): 4
Theoretical Context: The original Kolker et al. OMICS 2002 study used SWISS-PROT Release 39.16 (2000) with specific filtering criteria: eukaryotic sequences, removal of fragments, retention of only enzymes with EC numbers, and creation of non-redundant datasets by protein description.
Implementation Challenge: Modern UniProt datasets are more comprehensive and can allow for more sequences to be analyzed. Uniprot sequences were obtained from two different sources each were also preprocessed giving rise to four datasets. In this analysis we used a “standardization function” to ensure the correct columns were used in the analysis of all four datasets and ensures methodological consistency across datasets while preserving the essential data elements (ie. “Lengths”) required for SAD analysis.
Key Data Requirements: - Entry
identifiers for protein tracking - Length
values (amino acid counts) for distribution analysis
- EC numbers for enzyme classification -
Organism information for taxonomic filtering -
Protein Sequence/names for redundancy removal
Pipeline Design: This section integrates the SAD spectral analysis with mixture model fitting to provide comprehensive period detection and statistical validation, following the complete methodology from Kolker et al. OMICS 2002.
Dual Dataset Analysis: Following the paper’s approach, we analyze both complete and non-redundant datasets to assess the robustness of any detected periodicities. Consistent results across both datasets strengthen confidence in biological significance.
Parameter Integration: The preferred period detected by SAD serves as an informed prior for the mixture model’s fundamental period μ, linking the spectral and probabilistic approaches.
# Function to perform complete analysis
analyze_protein_lengths <- function(proteins_df, dataset_name,
min_length = 50, max_length = 600,
min_period = 2, max_period = 160, k_peaks = 4) {
# Create non-redundant dataset
nr_result <- create_nonredundant_dataset(proteins_df)
nonredundant_proteins <- nr_result$nonredundant
# Filter by length range
filtered_proteins <- proteins_df %>%
filter(length_aa >= min_length & length_aa <= max_length)
filtered_nonredundant <- nonredundant_proteins %>%
filter(length_aa >= min_length & length_aa <= max_length)
cat("Analysis Summary for", dataset_name, ":\n")
cat("Total proteins:", nrow(proteins_df), "\n")
cat("Nonredundant proteins:", nrow(nonredundant_proteins), "\n")
cat("Proteins", min_length, "-", max_length, "aa:", nrow(filtered_proteins),
"(", round(nrow(filtered_proteins)/nrow(proteins_df)*100, 1), "%)\n")
cat("Nonredundant proteins", min_length, "-", max_length, "aa:", nrow(filtered_nonredundant),
"(", round(nrow(filtered_nonredundant)/nrow(nonredundant_proteins)*100, 1), "%)\n\n")
# Apply SAD analysis
cat("Running SAD analysis on entire dataset...\n")
sad_results_all <- sad_analysis_kolker(filtered_proteins$length_aa,
min_period, max_period, min_length, max_length)
cat("Running SAD analysis on nonredundant dataset...\n")
sad_results_nonredundant <- sad_analysis_kolker(filtered_nonredundant$length_aa,
min_period, max_period, min_length, max_length)
# Find preferred periods
preferred_period_all <- sad_results_all$period[which.max(sad_results_all$amplitude)]
preferred_period_nonredundant <- sad_results_nonredundant$period[which.max(sad_results_nonredundant$amplitude)]
cat("Preferred period (entire dataset):", preferred_period_all, "aa\n")
cat("Preferred period (nonredundant dataset):", preferred_period_nonredundant, "aa\n\n")
# Prepare length counts for mixture model
length_counts_all <- table(filtered_proteins$length_aa)
length_counts_nonredundant <- table(filtered_nonredundant$length_aa)
# Fit mixture models
cat("Fitting mixture model to entire dataset...\n")
model_results_all <- fit_mixture_model_kolker(length_counts_all, preferred_period_all,
k_peaks, min_length, max_length)
cat("Fitting mixture model to nonredundant dataset...\n")
model_results_nonredundant <- fit_mixture_model_kolker(length_counts_nonredundant,
preferred_period_nonredundant,
k_peaks, min_length, max_length)
return(list(
filtered_proteins = filtered_proteins,
filtered_nonredundant = filtered_nonredundant,
sad_results_all = sad_results_all,
sad_results_nonredundant = sad_results_nonredundant,
model_results_all = model_results_all,
model_results_nonredundant = model_results_nonredundant,
length_counts_all = length_counts_all,
length_counts_nonredundant = length_counts_nonredundant,
preferred_period_all = preferred_period_all,
preferred_period_nonredundant = preferred_period_nonredundant
))
}Figure Replication Strategy: Our visualization functions strive to recreate the key figures from the original OMICS article (Figures 1, 4, and 3) to enable direct comparison with published results and validate our implementation.
Technical Details: Shows raw histogram with 41-amino acid smoothing window, matching the paper’s visualization approach. The smoothing reveals underlying trends while preserving the periodic structure.
Interpretation Guide: Peak amplitude indicates periodicity strength. The highest peak identifies the fundamental period, with the x-axis value showing the preferred domain size in amino acids.
Model Comparison: Overlays observed data with fitted mixture model and background-only model. Vertical lines mark the fundamental period and its multiples (1×, 2×, 3×, 4×).
Visual Validation: Good model fit appears as close agreement between the blue fitted line and black data bars, with clear peaks at the predicted periodic positions.
# Function to plot length distribution
plot_length_distribution <- function(data_vector, dataset_name, min_length = 50, max_length = 600) {
filtered_data <- data_vector[data_vector >= min_length & data_vector <= max_length]
hist_data <- hist(filtered_data, breaks = seq(min_length, max_length, by = 1), plot = FALSE)
window_size <- 41
half_window <- floor(window_size / 2)
plot_data <- data.frame(
length = min_length:max_length,
count = numeric(max_length - min_length + 1)
)
for (i in 1:length(hist_data$counts)) {
if (min_length + i - 1 <= max_length) {
plot_data$count[i] <- hist_data$counts[i]
}
}
smoothed_counts <- numeric(nrow(plot_data))
for (i in 1:nrow(plot_data)) {
start_idx <- max(1, i - half_window)
end_idx <- min(nrow(plot_data), i + half_window)
smoothed_counts[i] <- mean(plot_data$count[start_idx:end_idx])
}
par(mar = c(5, 5, 4, 2) + 0.1, cex.axis = 1.2, cex.lab = 1.3, cex.main = 1.4)
plot(plot_data$length, plot_data$count, type = 'h',
main = paste("Distribution of Enzyme Lengths:", dataset_name, "\n(Non-Redundant Dataset)"),
xlab = "Protein Length", ylab = "Number of Proteins",
xlim = c(min_length, max_length), ylim = c(0, max(plot_data$count) * 1.1),
col = "darkblue", lwd = 1)
lines(plot_data$length, smoothed_counts, col = "red", lwd = 2)
legend("topright",
legend = c("Raw Distribution", "Smoothed (41-aa window)"),
col = c("darkblue", "red"), lty = c(1, 1), lwd = c(1, 2),
bg = "white", cex = 1.2)
# return(list(raw = plot_data, smoothed = smoothed_counts)) # keeps RMD output concise.
}
# Function to plot cosine spectrum
plot_cosine_spectrum <- function(sad_results, dataset_name, max_period = 160) {
plot_data <- sad_results[sad_results$period <= max_period, ]
max_period_val <- plot_data$period[which.max(plot_data$amplitude)]
par(mar = c(5, 5, 4, 2) + 0.1, cex.axis = 1.2, cex.lab = 1.3, cex.main = 1.4)
plot(plot_data$period, plot_data$amplitude, type = 'l',
main = paste("Cosine Spectrum:", dataset_name),
xlab = "Period (amino acids)", ylab = "Amplitude",
xlim = c(0, max_period), col = "blue", lwd = 2)
# return(max_period_val) # keeping RMD output concise
}
# Function to create probability density plot
create_density_plot <- function(model_results, length_counts, dataset_name,
min_length = 50, max_length = 600, k = 4) {
mu <- model_results$mu; sigma <- model_results$sigma
alpha <- model_results$alpha; beta <- model_results$beta
p_values <- model_results$p_values; p_value <- model_results$p_value
if (any(is.na(c(mu, sigma, alpha, beta))) || any(is.na(p_values))) {
warning("Invalid model parameters for ", dataset_name)
return(NULL)
}
lengths <- min_length:max_length
tryCatch({
g_pdf <- gamma_pdf_normalized(lengths, alpha, beta, min_length, max_length)
background_only <- g_pdf
full_model <- (1 - sum(p_values)) * g_pdf
for (i in 1:k) {
peak_pdf <- normal_pdf_normalized(lengths, i*mu, sqrt(i)*sigma, min_length, max_length)
full_model <- full_model + p_values[i] * peak_pdf
}
observed <- numeric(length(lengths))
names(observed) <- as.character(lengths)
for (length in names(length_counts)) {
if (as.numeric(length) >= min_length && as.numeric(length) <= max_length) {
observed[length] <- length_counts[length]
}
}
total_obs <- sum(observed)
if (total_obs > 0) observed <- observed / total_obs
par(mar = c(5, 5, 4, 2) + 0.1, cex.axis = 1.2, cex.lab = 1.3, cex.main = 1.4)
plot(lengths, observed, type = 'h',
main = paste("Probability Density:", dataset_name, "\n(Non-Redundant Dataset)"),
xlab = "Protein Length", ylab = "Probability Density",
xlim = c(min_length, max_length),
ylim = c(0, max(c(observed, full_model, background_only), na.rm = TRUE) * 1.1),
col = "black", lwd = 1.2)
lines(lengths, full_model, col = "blue", lwd = 2.5)
lines(lengths, background_only, col = "red", lwd = 2, lty = 2)
if (!is.na(mu)) {
abline(v = c(mu, 2*mu, 3*mu, 4*mu), col = "blue", lty = 3, lwd = 1.5)
text(c(mu, 2*mu, 3*mu, 4*mu), rep(0, 4), c("1×", "2×", "3×", "4×"),
pos = 3, col = "blue", cex = 1.2)
}
legend("topright",
legend = c("Data", "Estimated model", "Background only"),
col = c("black", "blue", "red"), lty = c(1, 1, 2), lwd = c(1, 2.5, 2),
bg = "white", cex = 1.2)
mtext(paste("Period =", round(mu, 2), "aa (p-value =", format(p_value, scientific = TRUE, digits = 2), ")"),
side = 1, line = 3, cex = 1.2)
# return(list(lengths = lengths, observed = observed, full_model = full_model, # keeping RMD output concise
# background_only = background_only))
}, error = function(e) {
warning(paste("Error creating density plot for", dataset_name, ":", e$message))
return(NULL)
})
}# New dataset obtained by GS 9/7/25 read_tsv("uniprotkb_taxonomy_id_2759_AND_reviewed_unique_2.tsv") GS 9/7/25
cat("Loading dataset:", datasets[1], "\n")## Loading dataset: dataset_1.csv
# Read the dataset
proteins_raw_1 <- read_csv(datasets[1], show_col_types = FALSE)
cat("Raw dataset dimensions:", nrow(proteins_raw_1), "x", ncol(proteins_raw_1), "\n")## Raw dataset dimensions: 14685 x 7
## Dataset: Dataset_1
## Original columns: Entry, Reviewed, Entry Name, Protein names, Gene Names, Organism, Length
## Rows before filtering: 14685
## Rows after filtering: 14685
## Has EC number column: FALSE
## Has Sequence column: FALSE
## Has SeqID column: FALSE
## Standardized dataset dimensions: 14685 x 19
# Run the complete analysis
results_1 <- analyze_protein_lengths(proteins_1, dataset_names[1],
analysis_params$min_length, analysis_params$max_length,
analysis_params$min_period, analysis_params$max_period,
analysis_params$k_peaks)## Analysis Summary for Dataset_1 :
## Total proteins: 14685
## Nonredundant proteins: 14685
## Proteins 50 - 600 aa: 14685 ( 100 %)
## Nonredundant proteins 50 - 600 aa: 14685 ( 100 %)
##
## Running SAD analysis on entire dataset...
## Running SAD analysis on nonredundant dataset...
## Preferred period (entire dataset): 123 aa
## Preferred period (nonredundant dataset): 123 aa
##
## Fitting mixture model to entire dataset...
## Fitting mixture model to nonredundant dataset...
Table 5 Processed Datasets: Statistical Parameters and p-values (non-redundant)
## ===== STATISTICAL PARAMETERS FOR DATASET_1 =====
## All Proteins Non-redundant
## μ_pure_background 482.5149 482.5149
## σ_pure_background 213.0603 213.0603
## μ_background 450.2763 450.2763
## σ_background 193.0135 193.0135
## μ (fundamental period) 125.2119 125.2119
## σ 15.7515 15.7515
## p1 0.0108 0.0108
## p2 0.0010 0.0010
## p3 0.0498 0.0498
## p4 0.0850 0.0850
## p-value 2.36e-51 2.36e-51
## Loading dataset: dataset_2.csv
## Raw dataset dimensions: 13328 x 7
## Dataset: Dataset_2
## Original columns: Entry, Reviewed, Entry Name, Protein names, Gene Names, Organism, Length
## Rows before filtering: 13328
## Rows after filtering: 13328
## Has EC number column: FALSE
## Has Sequence column: FALSE
## Has SeqID column: FALSE
## Standardized dataset dimensions: 13328 x 19
## Analysis Summary for Dataset_2 :
## Total proteins: 13328
## Nonredundant proteins: 13328
## Proteins 50 - 600 aa: 13328 ( 100 %)
## Nonredundant proteins 50 - 600 aa: 13328 ( 100 %)
##
## Running SAD analysis on entire dataset...
## Running SAD analysis on nonredundant dataset...
## Preferred period (entire dataset): 123 aa
## Preferred period (nonredundant dataset): 123 aa
##
## Fitting mixture model to entire dataset...
## Fitting mixture model to nonredundant dataset...
## ===== STATISTICAL PARAMETERS FOR DATASET_2 =====
## All Proteins Non-redundant
## μ_pure_background 484.6348 484.6348
## σ_pure_background 214.6755 214.6755
## μ_background 420.2115 420.2115
## σ_background 162.8579 162.8579
## μ (fundamental period) 129.0371 129.0371
## σ 17.7137 17.7137
## p1 0.0216 0.0216
## p2 0.0010 0.0010
## p3 0.0278 0.0278
## p4 0.1072 0.1072
## p-value 1.31e-50 1.31e-50
# Independent reviewer 3 dataset originally called: "uniprotkb_taxonomy_id_2759_AND_ec_AND_r_2025_06_27.csv" was renamed dataset_3
cat("Loading dataset:", datasets[3], "\n")## Loading dataset: dataset_3.csv
# Read the dataset
proteins_raw_3 <- read_csv(datasets[3], show_col_types = FALSE)
cat("Raw dataset dimensions:", nrow(proteins_raw_3), "x", ncol(proteins_raw_3), "\n")## Raw dataset dimensions: 27600 x 10
## Dataset: Dataset_3
## Original columns: Entry, Reviewed, Entry Name, Protein names, Gene Names, Organism, Length, Sequence, Gene encoded by, EC number
## Rows before filtering: 27600
## Rows after filtering: 27600
## Has EC number column: TRUE
## Has Sequence column: TRUE
## Has SeqID column: FALSE
## Standardized dataset dimensions: 27600 x 22
# Run the complete analysis
results_3 <- analyze_protein_lengths(proteins_3, dataset_names[3],
analysis_params$min_length, analysis_params$max_length,
analysis_params$min_period, analysis_params$max_period,
analysis_params$k_peaks)## Analysis Summary for Dataset_3 :
## Total proteins: 27600
## Nonredundant proteins: 21813
## Proteins 50 - 600 aa: 18985 ( 68.8 %)
## Nonredundant proteins 50 - 600 aa: 14881 ( 68.2 %)
##
## Running SAD analysis on entire dataset...
## Running SAD analysis on nonredundant dataset...
## Preferred period (entire dataset): 123 aa
## Preferred period (nonredundant dataset): 123 aa
##
## Fitting mixture model to entire dataset...
## Fitting mixture model to nonredundant dataset...
## ===== STATISTICAL PARAMETERS FOR DATASET_3 =====
## All Proteins Non-redundant
## μ_pure_background 498.7061 500.2965
## σ_pure_background 240.2406 234.2714
## μ_background 507.5532 499.7194
## σ_background 265.8698 245.8420
## μ (fundamental period) 123.0974 122.9788
## σ 17.5746 15.7036
## p1 0.0017 0.0036
## p2 0.0067 0.0058
## p3 0.0957 0.0771
## p4 0.0941 0.0821
## p-value 1.54e-75 3.05e-55
Source file: filtered_enzyme_dataset_clusterize80.csv
## Loading dataset: dataset_4.csv
# Read the dataset
proteins_raw_4 <- read_csv(datasets[4], show_col_types = FALSE)
cat("Raw dataset dimensions:", nrow(proteins_raw_4), "x", ncol(proteins_raw_4), "\n")## Raw dataset dimensions: 20252 x 11
## Dataset: Dataset_4
## Original columns: Entry, Reviewed, Entry Name, Protein names, Gene Names, Organism, Length, Sequence, Gene encoded by, EC number, SeqID
## Rows before filtering: 20252
## Rows after filtering: 20252
## Has EC number column: TRUE
## Has Sequence column: TRUE
## Has SeqID column: TRUE
## Standardized dataset dimensions: 20252 x 23
# Run the complete analysis
results_4 <- analyze_protein_lengths(proteins_4, dataset_names[4],
analysis_params$min_length, analysis_params$max_length,
analysis_params$min_period, analysis_params$max_period,
analysis_params$k_peaks)## Analysis Summary for Dataset_4 :
## Total proteins: 20252
## Nonredundant proteins: 18063
## Proteins 50 - 600 aa: 13980 ( 69 %)
## Nonredundant proteins 50 - 600 aa: 12378 ( 68.5 %)
##
## Running SAD analysis on entire dataset...
## Running SAD analysis on nonredundant dataset...
## Preferred period (entire dataset): 123 aa
## Preferred period (nonredundant dataset): 123 aa
##
## Fitting mixture model to entire dataset...
## Fitting mixture model to nonredundant dataset...
## ===== STATISTICAL PARAMETERS FOR DATASET_4 =====
## All Proteins Non-redundant
cat(sprintf("%-25s %-20.4f %-20.4f\n", "μ_pure_background",
results_4$model_results_all$mu_pure_background,
results_4$model_results_nonredundant$mu_pure_background))## μ_pure_background 487.7593 489.0297
cat(sprintf("%-25s %-20.4f %-20.4f\n", "σ_pure_background",
results_4$model_results_all$sigma_pure_background,
results_4$model_results_nonredundant$sigma_pure_background))## σ_pure_background 222.0246 218.8946
cat(sprintf("%-25s %-20.4f %-20.4f\n", "μ_background",
results_4$model_results_all$mu_background,
results_4$model_results_nonredundant$mu_background))## μ_background 510.0662 509.4094
cat(sprintf("%-25s %-20.4f %-20.4f\n", "σ_background",
results_4$model_results_all$sigma_background,
results_4$model_results_nonredundant$sigma_background))## σ_background 252.8820 247.7414
cat(sprintf("%-25s %-20.4f %-20.4f\n", "μ (fundamental period)",
results_4$model_results_all$mu,
results_4$model_results_nonredundant$mu))## μ (fundamental period) 122.1311 122.4846
cat(sprintf("%-25s %-20.4f %-20.4f\n", "σ",
results_4$model_results_all$sigma,
results_4$model_results_nonredundant$sigma))## σ 16.7743 17.5077
for (i in 1:analysis_params$k_peaks) {
cat(sprintf("%-25s %-20.4f %-20.4f\n", paste0("p", i),
results_4$model_results_all$p_values[i],
results_4$model_results_nonredundant$p_values[i]))
}## p1 0.0010 0.0010
## p2 0.0147 0.0173
## p3 0.0860 0.0874
## p4 0.0845 0.0894
cat(sprintf("%-25s %-20s %-20s\n", "p-value",
format(results_4$model_results_all$p_value, scientific = TRUE, digits = 3),
format(results_4$model_results_nonredundant$p_value, scientific = TRUE, digits = 3)))## p-value 6.18e-41 3.23e-33
# Create comprehensive statistical parameters table
statistical_params_table <- data.frame(
Parameter = c("μ_pure_background", "σ_pure_background", "μ_background", "σ_background",
"μ (fundamental period)", "σ", "p1", "p2", "p3", "p4", "p-value"),
# Dataset_1_All = c(
# round(all_results[[1]]$model_results_all$mu_pure_background, 4),
# round(all_results[[1]]$model_results_all$sigma_pure_background, 4),
# round(all_results[[1]]$model_results_all$mu_background, 4),
# round(all_results[[1]]$model_results_all$sigma_background, 4),
# round(all_results[[1]]$model_results_all$mu, 4),
# round(all_results[[1]]$model_results_all$sigma, 4),
# round(all_results[[1]]$model_results_all$p_values[1], 4),
# round(all_results[[1]]$model_results_all$p_values[2], 4),
# round(all_results[[1]]$model_results_all$p_values[3], 4),
# round(all_results[[1]]$model_results_all$p_values[4], 4),
# format(all_results[[1]]$model_results_all$p_value, scientific = TRUE, digits = 3)
# ),
Dataset_1_NR = c(
round(all_results[[1]]$model_results_nonredundant$mu_pure_background, 4),
round(all_results[[1]]$model_results_nonredundant$sigma_pure_background, 4),
round(all_results[[1]]$model_results_nonredundant$mu_background, 4),
round(all_results[[1]]$model_results_nonredundant$sigma_background, 4),
round(all_results[[1]]$model_results_nonredundant$mu, 4),
round(all_results[[1]]$model_results_nonredundant$sigma, 4),
round(all_results[[1]]$model_results_nonredundant$p_values[1], 4),
round(all_results[[1]]$model_results_nonredundant$p_values[2], 4),
round(all_results[[1]]$model_results_nonredundant$p_values[3], 4),
round(all_results[[1]]$model_results_nonredundant$p_values[4], 4),
format(all_results[[1]]$model_results_nonredundant$p_value, scientific = TRUE, digits = 3)
),
# Dataset_2_All = c(
# round(all_results[[2]]$model_results_all$mu_pure_background, 4),
# round(all_results[[2]]$model_results_all$sigma_pure_background, 4),
# round(all_results[[2]]$model_results_all$mu_background, 4),
# round(all_results[[2]]$model_results_all$sigma_background, 4),
# round(all_results[[2]]$model_results_all$mu, 4),
# round(all_results[[2]]$model_results_all$sigma, 4),
# round(all_results[[2]]$model_results_all$p_values[1], 4),
# round(all_results[[2]]$model_results_all$p_values[2], 4),
# round(all_results[[2]]$model_results_all$p_values[3], 4),
# round(all_results[[2]]$model_results_all$p_values[4], 4),
# format(all_results[[2]]$model_results_all$p_value, scientific = TRUE, digits = 3)
# ),
Dataset_2_NR = c(
round(all_results[[2]]$model_results_nonredundant$mu_pure_background, 4),
round(all_results[[2]]$model_results_nonredundant$sigma_pure_background, 4),
round(all_results[[2]]$model_results_nonredundant$mu_background, 4),
round(all_results[[2]]$model_results_nonredundant$sigma_background, 4),
round(all_results[[2]]$model_results_nonredundant$mu, 4),
round(all_results[[2]]$model_results_nonredundant$sigma, 4),
round(all_results[[2]]$model_results_nonredundant$p_values[1], 4),
round(all_results[[2]]$model_results_nonredundant$p_values[2], 4),
round(all_results[[2]]$model_results_nonredundant$p_values[3], 4),
round(all_results[[2]]$model_results_nonredundant$p_values[4], 4),
format(all_results[[2]]$model_results_nonredundant$p_value, scientific = TRUE, digits = 3)
),
# Dataset_3_All = c(
# round(all_results[[3]]$model_results_all$mu_pure_background, 4),
# round(all_results[[3]]$model_results_all$sigma_pure_background, 4),
# round(all_results[[3]]$model_results_all$mu_background, 4),
# round(all_results[[3]]$model_results_all$sigma_background, 4),
# round(all_results[[3]]$model_results_all$mu, 4),
# round(all_results[[3]]$model_results_all$sigma, 4),
# round(all_results[[3]]$model_results_all$p_values[1], 4),
# round(all_results[[3]]$model_results_all$p_values[2], 4),
# round(all_results[[3]]$model_results_all$p_values[3], 4),
# round(all_results[[3]]$model_results_all$p_values[4], 4),
# format(all_results[[3]]$model_results_all$p_value, scientific = TRUE, digits = 3)
# ),
Dataset_3_NR = c(
round(all_results[[3]]$model_results_nonredundant$mu_pure_background, 4),
round(all_results[[3]]$model_results_nonredundant$sigma_pure_background, 4),
round(all_results[[3]]$model_results_nonredundant$mu_background, 4),
round(all_results[[3]]$model_results_nonredundant$sigma_background, 4),
round(all_results[[3]]$model_results_nonredundant$mu, 4),
round(all_results[[3]]$model_results_nonredundant$sigma, 4),
round(all_results[[3]]$model_results_nonredundant$p_values[1], 4),
round(all_results[[3]]$model_results_nonredundant$p_values[2], 4),
round(all_results[[3]]$model_results_nonredundant$p_values[3], 4),
round(all_results[[3]]$model_results_nonredundant$p_values[4], 4),
format(all_results[[3]]$model_results_nonredundant$p_value, scientific = TRUE, digits = 3)
),
# Dataset_4_All = c(
# round(all_results[[4]]$model_results_all$mu_pure_background, 4),
# round(all_results[[4]]$model_results_all$sigma_pure_background, 4),
# round(all_results[[4]]$model_results_all$mu_background, 4),
# round(all_results[[4]]$model_results_all$sigma_background, 4),
# round(all_results[[4]]$model_results_all$mu, 4),
# round(all_results[[4]]$model_results_all$sigma, 4),
# round(all_results[[4]]$model_results_all$p_values[1], 4),
# round(all_results[[4]]$model_results_all$p_values[2], 4),
# round(all_results[[4]]$model_results_all$p_values[3], 4),
# round(all_results[[4]]$model_results_all$p_values[4], 4),
# format(all_results[[4]]$model_results_all$p_value, scientific = TRUE, digits = 3)
# ),
Dataset_4_NR = c(
round(all_results[[4]]$model_results_nonredundant$mu_pure_background, 4),
round(all_results[[4]]$model_results_nonredundant$sigma_pure_background, 4),
round(all_results[[4]]$model_results_nonredundant$mu_background, 4),
round(all_results[[4]]$model_results_nonredundant$sigma_background, 4),
round(all_results[[4]]$model_results_nonredundant$mu, 4),
round(all_results[[4]]$model_results_nonredundant$sigma, 4),
round(all_results[[4]]$model_results_nonredundant$p_values[1], 4),
round(all_results[[4]]$model_results_nonredundant$p_values[2], 4),
round(all_results[[4]]$model_results_nonredundant$p_values[3], 4),
round(all_results[[4]]$model_results_nonredundant$p_values[4], 4),
format(all_results[[4]]$model_results_nonredundant$p_value, scientific = TRUE, digits = 3)
)
)
# Display the comprehensive statistical parameters table
kable(statistical_params_table,
col.names = c("Parameter", "NR", "NR", "NR", "NR"),
caption = "Comprehensive Statistical Parameters: All Four Datasets") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE, font_size = 10) %>%
add_header_above(c(" " = 1, "Dataset 1" = 1, "Dataset 2" = 1, "Dataset 3" = 1, "Dataset 4" = 1)) %>%
column_spec(1, bold = TRUE, width = "3cm") %>%
column_spec(2:5, width = "1.5cm")| Parameter | NR | NR | NR | NR |
|---|---|---|---|---|
| μ_pure_background | 482.5149 | 484.6348 | 500.2965 | 489.0297 |
| σ_pure_background | 213.0603 | 214.6755 | 234.2714 | 218.8946 |
| μ_background | 450.2763 | 420.2115 | 499.7194 | 509.4094 |
| σ_background | 193.0135 | 162.8579 | 245.842 | 247.7414 |
| μ (fundamental period) | 125.2119 | 129.0371 | 122.9788 | 122.4846 |
| σ | 15.7515 | 17.7137 | 15.7036 | 17.5077 |
| p1 | 0.0108 | 0.0216 | 0.0036 | 0.001 |
| p2 | 0.001 | 0.001 | 0.0058 | 0.0173 |
| p3 | 0.0498 | 0.0278 | 0.0771 | 0.0874 |
| p4 | 0.085 | 0.1072 | 0.0821 | 0.0894 |
| p-value | 2.36e-51 | 1.31e-50 | 3.05e-55 | 3.23e-33 |
# Create ggplot versions of density plots for patchwork
create_ggplot_density <- function(model_results, length_counts, dataset_name,
min_length = 50, max_length = 600, k = 4) {
mu <- model_results$mu; sigma <- model_results$sigma
alpha <- model_results$alpha; beta <- model_results$beta
p_values <- model_results$p_values; p_value <- model_results$p_value
if (any(is.na(c(mu, sigma, alpha, beta))) || any(is.na(p_values))) {
return(ggplot() + ggtitle(paste("Error:", dataset_name)) + theme_void())
}
lengths <- min_length:max_length
tryCatch({
g_pdf <- gamma_pdf_normalized(lengths, alpha, beta, min_length, max_length)
background_only <- g_pdf
full_model <- (1 - sum(p_values)) * g_pdf
for (i in 1:k) {
peak_pdf <- normal_pdf_normalized(lengths, i*mu, sqrt(i)*sigma, min_length, max_length)
full_model <- full_model + p_values[i] * peak_pdf
}
observed <- numeric(length(lengths))
names(observed) <- as.character(lengths)
for (length in names(length_counts)) {
if (as.numeric(length) >= min_length && as.numeric(length) <= max_length) {
observed[length] <- length_counts[length]
}
}
total_obs <- sum(observed)
if (total_obs > 0) observed <- observed / total_obs
# Create plot data
plot_data <- data.frame(
length = lengths,
observed = observed,
full_model = full_model,
background_only = background_only
)
p <- ggplot(plot_data, aes(x = length)) +
geom_col(aes(y = observed), color = "black", fill = "lightgray", width = 0.8) +
geom_line(aes(y = full_model), color = "blue", size = 1.2) +
geom_line(aes(y = background_only), color = "red", size = 1, linetype = "dashed") +
labs(title = paste("Probability Density:", dataset_name),
subtitle = paste("μ =", round(mu, 2), "aa, p =", format(p_value, scientific = TRUE, digits = 2)),
x = "Protein Length (aa)",
y = "Probability Density") +
xlim(min_length, max_length) +
theme_minimal() +
theme(plot.title = element_text(size = 12, hjust = 0.5),
plot.subtitle = element_text(size = 10, hjust = 0.5),
axis.title = element_text(size = 10),
axis.text = element_text(size = 9))
# Add periodic markers
if (!is.na(mu)) {
p <- p + geom_vline(xintercept = c(mu, 2*mu, 3*mu, 4*mu),
color = "blue", linetype = "dotted", alpha = 0.7)
}
return(p)
}, error = function(e) {
return(ggplot() + ggtitle(paste("Error:", dataset_name)) + theme_void())
})
}
# Create individual density plots
p1_density <- create_ggplot_density(all_results[[1]]$model_results_nonredundant,
all_results[[1]]$length_counts_nonredundant,
dataset_names[1], analysis_params$min_length,
analysis_params$max_length, analysis_params$k_peaks)
p2_density <- create_ggplot_density(all_results[[2]]$model_results_nonredundant,
all_results[[2]]$length_counts_nonredundant,
dataset_names[2], analysis_params$min_length,
analysis_params$max_length, analysis_params$k_peaks)
p3_density <- create_ggplot_density(all_results[[3]]$model_results_nonredundant,
all_results[[3]]$length_counts_nonredundant,
dataset_names[3], analysis_params$min_length,
analysis_params$max_length, analysis_params$k_peaks)
p4_density <- create_ggplot_density(all_results[[4]]$model_results_nonredundant,
all_results[[4]]$length_counts_nonredundant,
dataset_names[4], analysis_params$min_length,
analysis_params$max_length, analysis_params$k_peaks)
# Combine with patchwork
combined_density <- (p1_density | p2_density) / (p3_density | p4_density)
combined_density + plot_annotation(
title = "Comparative Probability Density Models: All Four Datasets (Non-redundant)",
subtitle = "Mixture Model Fitting with Periodic Components vs Background-Only Models",
theme = theme(plot.title = element_text(size = 16, hjust = 0.5),
plot.subtitle = element_text(size = 12, hjust = 0.5))
)## STATISTICAL MIXTURE MODEL: Extension of OMICS 2002 Framework
## OMICS 2002 introduced probabilistic mixture model combining:
## 1. Smooth background distribution (gamma family for flexibility with skewness)
## 2. Individual peaks at preferred lengths μ and multiples 2μ, 3μ, ..., kμ
## 3. Peak strengths p1, p2, ..., pk and standard deviations σ, √2σ, ..., √kσ
## BIOLOGICAL INTERPRETATION: "protein chain composed of i subunits, each with
## mean length μ and standard deviation σ, statistically independent"
## This reflects domain-based organization of eukaryotic proteins
## MODERN STATISTICAL EXTENSION: Maximum Likelihood with Constrained Optimization
## OMICS 2002 used "method of maximum likelihood" but modern implementation
## uses robust optimization algorithms (L-BFGS-B) with parameter constraints
# Log-likelihood for full mixture model (gamma background + 4 normal peaks)
log_likelihood_mixture <- function(params, lengths) {
alpha <- params[1] # gamma shape
beta <- params[2] # gamma scale
mu <- params[3] # base period
sigma <- params[4] # base std dev
p1 <- params[5] # weight for 1x peak
p2 <- params[6] # weight for 2x peak
p3 <- params[7] # weight for 3x peak
p4 <- params[8] # weight for 4x peak
# Constraint: weights must sum to <= 1
if (sum(p1, p2, p3, p4) > 1) return(Inf)
p0 <- 1 - (p1 + p2 + p3 + p4) # background weight
## MIXTURE MODEL COMPONENTS: Following OMICS 2002 Statistical Framework
## Background: Gamma distribution for flexible background shapes
## Peaks: Normal distributions at multiples of base period μ
## Standard deviations scale as √i*σ for i-th multiple (central limit theorem)
# Component PDFs
background_pdf <- dgamma(lengths, shape = alpha, scale = beta)
peak1_pdf <- dnorm(lengths, mean = mu, sd = sigma)
peak2_pdf <- dnorm(lengths, mean = 2*mu, sd = sqrt(2)*sigma)
peak3_pdf <- dnorm(lengths, mean = 3*mu, sd = sqrt(3)*sigma)
peak4_pdf <- dnorm(lengths, mean = 4*mu, sd = sqrt(4)*sigma)
# Mixture PDF
mixture_pdf <- p0 * background_pdf + p1 * peak1_pdf + p2 * peak2_pdf +
p3 * peak3_pdf + p4 * peak4_pdf
# Return negative log-likelihood for minimization
-sum(log(mixture_pdf + 1e-10))
}
# Log-likelihood for background-only model (gamma distribution)
log_likelihood_background <- function(params, lengths) {
alpha <- params[1]
beta <- params[2]
background_pdf <- dgamma(lengths, shape = alpha, scale = beta)
-sum(log(background_pdf + 1e-10))
}
## STATISTICAL TESTING FRAMEWORK: Extensions Beyond OMICS 2002
## Original paper used likelihood ratio test for significance testing
## Modern implementation adds:
## 1. AIC/BIC model comparison (information-theoretic criteria)
## 2. Bootstrap confidence intervals (uncertainty quantification)
## 3. Robust optimization with convergence diagnostics# Sequential Dataset Processing
# Initialize storage for results
all_results_teamc <- list()
# Loop through each dataset
for (dataset_idx in 1:length(datasets)) {
# Use the existing datasets and dataset_names vectors
filename <- datasets[dataset_idx]
variable_name <- dataset_names[dataset_idx]
description <- paste("Dataset", dataset_idx, "-", variable_name)
cat("\n## Dataset", dataset_idx, ":", variable_name, "\n\n")
# Initialize results storage
dataset_results <- list(
variable_name = variable_name,
filename = filename,
description = description
)
tryCatch({
# Load data - Use the already loaded and standardized data
cat("Using pre-loaded and standardized data for", variable_name, "...\n")
# Get the appropriate filtered dataset from your existing results
analysis_data <- all_results[[dataset_idx]]$filtered_proteins
# Store initial row count
initial_row_count <- nrow(analysis_data)
# The data is already filtered for 50-600 aa in your pipeline, so use it directly
enzyme_lengths <- analysis_data$length_aa
cat("Using", length(enzyme_lengths), "valid sequences (50-600 aa)\n")
# Store basic info including initial count
dataset_results$initial_row_count <- initial_row_count
dataset_results$n_sequences <- length(enzyme_lengths)
dataset_results$mean_length <- mean(enzyme_lengths)
dataset_results$median_length <- median(enzyme_lengths)
dataset_results$sd_length <- sd(enzyme_lengths)
dataset_results$min_length <- min(enzyme_lengths)
dataset_results$max_length <- max(enzyme_lengths)
if (length(enzyme_lengths) > 0) {
## MIXTURE MODEL ANALYSIS: Statistical Framework Implementation
## OMICS 2002 introduced mixture model for rigorous statistical testing
## Modern implementation extends with robust optimization and uncertainty quantification
# MIXTURE MODEL ANALYSIS
cat("Fitting mixture model...\n")
## PARAMETER INITIALIZATION: Based on OMICS 2002 Estimates
## Original study reported: μ ≈ 125 aa, σ ≈ 12 aa
## Gamma parameters estimated from background distribution characteristics
## Mixing weights initialized based on observed peak prominence
# Parameter setup
initial_params <- c(1.5, 200, 120, 12, 0.05, 0.1, 0.08, 0.07)
lower_bounds <- c(0.1, 50, 100, 5, 0, 0, 0, 0)
upper_bounds <- c(10, 500, 150, 20, 0.3, 0.3, 0.3, 0.3)
## OPTIMIZATION STRATEGY: Modern Maximum Likelihood Implementation
## METHODOLOGICAL EXTENSION: L-BFGS-B algorithm with box constraints
## provides more robust parameter estimation than methods available in 2002
# Fit full mixture model
mixture_result <- optim(
par = initial_params,
fn = log_likelihood_mixture,
lengths = enzyme_lengths,
method = "L-BFGS-B",
lower = lower_bounds,
upper = upper_bounds,
control = list(maxit = 1000)
)
# Fit background-only model
background_result <- optim(
par = initial_params[1:2],
fn = log_likelihood_background,
lengths = enzyme_lengths,
method = "L-BFGS-B",
lower = lower_bounds[1:2],
upper = upper_bounds[1:2]
)
# Extract parameters
opt_params <- mixture_result$par
names(opt_params) <- c("alpha", "beta", "mu", "sigma", "p1", "p2", "p3", "p4")
# Calculate derived parameters
alpha <- opt_params["alpha"]
beta <- opt_params["beta"]
mu <- opt_params["mu"]
sigma <- opt_params["sigma"]
p1 <- opt_params["p1"]
p2 <- opt_params["p2"]
p3 <- opt_params["p3"]
p4 <- opt_params["p4"]
p0 <- 1 - (p1 + p2 + p3 + p4)
cat("Full model optimization completed (convergence:", mixture_result$convergence == 0, ")\n")
cat("Background model optimization completed (convergence:", background_result$convergence == 0, ")\n")
## STATISTICAL TESTING: Likelihood Ratio Test (OMICS 2002) + Modern Extensions
## Original methodology: χ² test with (k+2) degrees of freedom
## Modern additions: AIC/BIC for model comparison, robust uncertainty quantification
# Statistical tests
lrt_statistic <- 2 * (background_result$value - mixture_result$value)
p_value <- 1 - pchisq(lrt_statistic, df = 6)
n <- length(enzyme_lengths)
n_params_full <- 8
n_params_background <- 2
## MODEL SELECTION CRITERIA: Information-Theoretic Extensions
## AIC/BIC provide alternative to pure significance testing
## These criteria were not available/standard in 2002 bioinformatics
AIC_full <- 2 * mixture_result$value + 2 * n_params_full
AIC_background <- 2 * background_result$value + 2 * n_params_background
BIC_full <- 2 * mixture_result$value + n_params_full * log(n)
BIC_background <- 2 * background_result$value + n_params_background * log(n)
cat("\nStatistical Testing:\n")
cat("LRT statistic:", round(lrt_statistic, 2), "\n")
cat("P-value:", ifelse(p_value < 2.2e-16, "< 2.2e-16", sprintf("%.2e", p_value)), "\n")
cat("AIC improvement:", round(AIC_background - AIC_full, 1), "\n")
cat("BIC improvement:", round(BIC_background - BIC_full, 1), "\n")
## BOOTSTRAP ANALYSIS: Modern Uncertainty Quantification
## MAJOR EXTENSION BEYOND OMICS 2002: Bootstrap confidence intervals
## Provides robust uncertainty estimates for parameter estimates
## Not available in original computational framework (2002 limitations)
# Bootstrap Analysis
cat("\nRunning bootstrap analysis for confidence intervals...\n")
n_bootstrap <- 100
bootstrap_params <- matrix(NA, nrow = n_bootstrap, ncol = 8)
set.seed(42)
for (i in 1:n_bootstrap) {
if (i %% 20 == 0) cat(" Bootstrap iteration", i, "of", n_bootstrap, "\n")
# Bootstrap sample
sample_indices <- sample(length(enzyme_lengths), replace = TRUE)
bootstrap_sample <- enzyme_lengths[sample_indices]
# Fit model to bootstrap sample
tryCatch({
boot_result <- optim(
par = opt_params,
fn = log_likelihood_mixture,
lengths = bootstrap_sample,
method = "L-BFGS-B",
lower = lower_bounds,
upper = upper_bounds,
control = list(maxit = 200)
)
if (boot_result$convergence == 0) {
bootstrap_params[i, ] <- boot_result$par
} else {
bootstrap_params[i, ] <- opt_params
}
}, error = function(e) {
bootstrap_params[i, ] <- opt_params
})
}
# Calculate confidence intervals
valid_boots <- complete.cases(bootstrap_params)
n_valid_boots <- sum(valid_boots)
if (n_valid_boots >= 10) {
lower_ci <- apply(bootstrap_params[valid_boots, ], 2, quantile, probs = 0.025, na.rm = TRUE)
upper_ci <- apply(bootstrap_params[valid_boots, ], 2, quantile, probs = 0.975, na.rm = TRUE)
} else {
warning("Insufficient successful bootstrap iterations")
lower_ci <- upper_ci <- opt_params
}
cat("Bootstrap completed:", n_valid_boots, "successful iterations\n")
## PARAMETER INTERPRETATION: Biological Significance
## μ (base period): Fundamental domain unit size
## σ (period standard deviation): Variability in domain organization
## p_i (mixing weights): Relative frequency of i-domain proteins
## Bootstrap CIs quantify uncertainty in these biological parameters
# Parameter Estimates Table
param_names <- c("α (gamma shape)", "β (gamma scale)", "μ (base period)", "σ (period std)",
"p₁ (1× weight)", "p₂ (2× weight)", "p₃ (3× weight)", "p₄ (4× weight)")
parameter_estimates <- data.frame(
Parameter = param_names,
Estimate = round(opt_params, 4),
Lower_CI = round(lower_ci, 4),
Upper_CI = round(upper_ci, 4),
CI_Width = round(upper_ci - lower_ci, 4)
)
# keeping RMD concise - there is a summary table for all four datasets at the end.
# print(kable(parameter_estimates,
# caption = paste0("Parameter Estimates with 95% CIs - ", variable_name),
# col.names = c("Parameter", "Estimate", "Lower CI", "Upper CI", "CI Width")) %>%
# kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
# full_width = FALSE) %>%
# row_spec(0, bold = TRUE, background = "#f8f9fa") %>%
# pack_rows("Background Distribution", 1, 2, label_row_css = "background-color: #e3f2fd;") %>%
# pack_rows("Periodic Structure", 3, 4, label_row_css = "background-color: #f3e5f5;") %>%
# pack_rows("Mixing Weights", 5, 8, label_row_css = "background-color: #e8f5e8;"))
## MIXTURE MODEL VISUALIZATION: Enhanced Implementation
## OMICS 2002 Figure 3 inspired visualization with modern enhancements
## Shows empirical data, estimated model components, and full mixture fit
# Individual Mixture Model Visualization
cat("\n### Mixture Model Visualization -", variable_name, "\n\n")
# Prepare data for plotting
x_range <- seq(50, 600, length.out = 551)
# Calculate component PDFs
background_pdf <- dgamma(x_range, shape = alpha, scale = beta)
peak1_pdf <- dnorm(x_range, mean = mu, sd = sigma)
peak2_pdf <- dnorm(x_range, mean = 2*mu, sd = sqrt(2)*sigma)
peak3_pdf <- dnorm(x_range, mean = 3*mu, sd = sqrt(3)*sigma)
peak4_pdf <- dnorm(x_range, mean = 4*mu, sd = sqrt(4)*sigma)
# Create mixture PDF
mixture_pdf <- p0 * background_pdf + p1 * peak1_pdf + p2 * peak2_pdf +
p3 * peak3_pdf + p4 * peak4_pdf
# Create histogram data
hist_data <- hist(enzyme_lengths, breaks = seq(50, 600, by = 10), plot = FALSE)
hist_density <- hist_data$density
hist_centers <- hist_data$mids
# Enhanced visualization
par(mar = c(8, 5, 4, 2), bg = "white")
# Plot histogram
plot(hist_centers, hist_density, type = "h", lwd = 6, col = "lightblue",
xlim = c(50, 600), ylim = c(0, max(hist_density) * 1.1),
main = paste("Mixture Model -", variable_name),
xlab = "Sequence Length (amino acids)", ylab = "Density",
cex.main = 1.3, cex.lab = 1.2, cex.axis = 1.1)
# Add model components
lines(x_range, p0 * background_pdf, col = "red", lwd = 3, lty = 2)
if(p1 > 0.001) lines(x_range, p1 * peak1_pdf, col = "green", lwd = 2)
if(p2 > 0.001) lines(x_range, p2 * peak2_pdf, col = "blue", lwd = 2)
if(p3 > 0.001) lines(x_range, p3 * peak3_pdf, col = "magenta", lwd = 2)
if(p4 > 0.001) lines(x_range, p4 * peak4_pdf, col = "cyan", lwd = 2)
lines(x_range, mixture_pdf, col = "black", lwd = 3)
# Add vertical lines at significant peak positions
if(p1 > 0.001) abline(v = mu, col = "green", lty = 3, lwd = 2) # <-- ADDED THIS
if(p2 > 0.001) abline(v = 2*mu, col = "blue", lty = 3, lwd = 2)
if(p3 > 0.001) abline(v = 3*mu, col = "magenta", lty = 3, lwd = 2)
if(p4 > 0.001) abline(v = 4*mu, col = "cyan", lty = 3, lwd = 2)
# Legend
legend_items <- c("Data", "Background", "Full Model")
legend_colors <- c("lightblue", "red", "black")
legend_lty <- c(1, 2, 1)
legend_lwd <- c(6, 3, 3)
if(p1 > 0.001) { # <-- ADDED THIS BLOCK
legend_items <- c(legend_items, paste0(round(mu), " aa (1×)"))
legend_colors <- c(legend_colors, "green")
legend_lty <- c(legend_lty, 1)
legend_lwd <- c(legend_lwd, 2)
}
if(p2 > 0.001) {
legend_items <- c(legend_items, paste0(round(2*mu), " aa (2×)"))
legend_colors <- c(legend_colors, "blue")
legend_lty <- c(legend_lty, 1)
legend_lwd <- c(legend_lwd, 2)
}
if(p3 > 0.001) {
legend_items <- c(legend_items, paste0(round(3*mu), " aa (3×)"))
legend_colors <- c(legend_colors, "magenta")
legend_lty <- c(legend_lty, 1)
legend_lwd <- c(legend_lwd, 2)
}
if(p4 > 0.001) {
legend_items <- c(legend_items, paste0(round(4*mu), " aa (4×)"))
legend_colors <- c(legend_colors, "cyan")
legend_lty <- c(legend_lty, 1)
legend_lwd <- c(legend_lwd, 2)
}
legend("topright", legend = legend_items, col = legend_colors,
lty = legend_lty, lwd = legend_lwd, cex = 1.0, bg = "white")
# Format p-value
p_value_text <- if(p_value < 2.2e-16) {
"< 2.2e-16"
} else {
format(p_value, scientific = TRUE, digits = 2)
}
# Statistical annotation
mtext(paste0("Base Period: ", round(mu, 1), " ± ", round(sigma, 1), " aa | ",
"LRT p-value: ", p_value_text, " | ",
"Bootstrap CI: ", round(lower_ci[3], 1), " - ", round(upper_ci[3], 1), " aa"),
side = 1, line = 6, cex = 1.0, font = 2)
grid(col = "gray90", lty = "dotted")
# Store mixture model results (expanded)
dataset_results$mixture_model <- list(
converged = mixture_result$convergence == 0,
alpha = as.numeric(alpha),
beta = as.numeric(beta),
mu = as.numeric(mu),
sigma = as.numeric(sigma),
p0 = as.numeric(p0),
p1 = as.numeric(p1),
p2 = as.numeric(p2),
p3 = as.numeric(p3),
p4 = as.numeric(p4),
lrt_statistic = lrt_statistic,
p_value = p_value,
AIC_delta = AIC_background - AIC_full,
BIC_delta = BIC_background - BIC_full,
bootstrap = list(
n_successful = n_valid_boots,
lower_ci = lower_ci,
upper_ci = upper_ci
),
# Store plot data for composite
hist_centers = hist_centers,
hist_density = hist_density,
x_range = x_range,
mixture_pdf = mixture_pdf,
background_pdf = background_pdf,
peak1_pdf = peak1_pdf,
peak2_pdf = peak2_pdf,
peak3_pdf = peak3_pdf,
peak4_pdf = peak4_pdf
)
cat("Analysis completed for", variable_name, "\n")
} else {
dataset_results$error <- "No valid data"
}
}, error = function(e) {
cat("ERROR loading", filename, ":", e$message, "\n")
dataset_results$error <- e$message
})
# Store results
all_results_teamc[[dataset_idx]] <- dataset_results
}Using pre-loaded and standardized data for Dataset_1 … Using 14685 valid sequences (50-600 aa) Fitting mixture model… Full model optimization completed (convergence: TRUE ) Background model optimization completed (convergence: TRUE )
Statistical Testing: LRT statistic: 2710.89 P-value: < 2.2e-16 AIC improvement: 2698.9 BIC improvement: 2653.3
Running bootstrap analysis for confidence intervals… Bootstrap iteration 20 of 100 Bootstrap iteration 40 of 100 Bootstrap iteration 60 of 100 Bootstrap iteration 80 of 100 Bootstrap iteration 100 of 100 Bootstrap completed: 97 successful iterations
Analysis
completed for Dataset_1
Using pre-loaded and standardized data for Dataset_2 … Using 13328 valid sequences (50-600 aa) Fitting mixture model… Full model optimization completed (convergence: TRUE ) Background model optimization completed (convergence: TRUE )
Statistical Testing: LRT statistic: 2513.56 P-value: < 2.2e-16 AIC improvement: 2501.6 BIC improvement: 2456.6
Running bootstrap analysis for confidence intervals… Bootstrap iteration 20 of 100 Bootstrap iteration 40 of 100 Bootstrap iteration 60 of 100 Bootstrap iteration 80 of 100 Bootstrap iteration 100 of 100 Bootstrap completed: 100 successful iterations
Analysis
completed for Dataset_2
Using pre-loaded and standardized data for Dataset_3 … Using 18985 valid sequences (50-600 aa) Fitting mixture model… Full model optimization completed (convergence: TRUE ) Background model optimization completed (convergence: TRUE )
Statistical Testing: LRT statistic: 3951.86 P-value: < 2.2e-16 AIC improvement: 3939.9 BIC improvement: 3892.7
Running bootstrap analysis for confidence intervals… Bootstrap iteration 20 of 100 Bootstrap iteration 40 of 100 Bootstrap iteration 60 of 100 Bootstrap iteration 80 of 100 Bootstrap iteration 100 of 100 Bootstrap completed: 100 successful iterations
Analysis
completed for Dataset_3
Using pre-loaded and standardized data for Dataset_4 … Using 13980 valid sequences (50-600 aa) Fitting mixture model… Full model optimization completed (convergence: TRUE ) Background model optimization completed (convergence: TRUE )
Statistical Testing: LRT statistic: 2587.59 P-value: < 2.2e-16 AIC improvement: 2575.6 BIC improvement: 2530.3
Running bootstrap analysis for confidence intervals… Bootstrap iteration 20 of 100 Bootstrap iteration 40 of 100 Bootstrap iteration 60 of 100 Bootstrap iteration 80 of 100 Bootstrap iteration 100 of 100 Bootstrap completed: 100 successful iterations
| Dataset | N | Base Period | 95% CI | LRT χ² | P-value | ΔAIC | ΔBIC | Bootstrap | p₀ | p₁ | p₂ | p₃ | p₄ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Control Group | |||||||||||||
| Dataset_1 | 14685 | 128 ± 20 | 127.6 - 128.4 | 2710.89 | < 2.2e-16 | 2698.9 | 2653.3 | 97/100 (97%) | 52.5% (49.1 - 55.7%) | 0.5% (0 - 0.8%) | 0.5% (0 - 1.4%) | 17.8% (16.5 - 19%) | 28.7% (27.8 - 29.7%) |
| Dataset_2 | 13328 | 127.9 ± 20 | 127.3 - 128.3 | 2513.56 | < 2.2e-16 | 2501.6 | 2456.6 | 100/100 (100%) | 51.9% (48 - 55.2%) | 0.8% (0.4 - 1.1%) | 0.6% (0 - 1.9%) | 17.8% (16.5 - 19.1%) | 28.9% (28 - 29.9%) |
| Independent Datasets | |||||||||||||
| Dataset_3 | 18985 | 127.7 ± 20 | 127.4 - 128.3 | 3951.86 | < 2.2e-16 | 3939.9 | 3892.7 | 100/100 (100%) | 51.5% (49.1 - 53.6%) | 0.3% (0 - 0.7%) | 0% (0 - 0.1%) | 19.4% (18.4 - 20.6%) | 28.8% (28 - 29.5%) |
| Dataset_4 | 13980 | 127.7 ± 20 | 127.1 - 128.4 | 2587.59 | < 2.2e-16 | 2575.6 | 2530.3 | 100/100 (100%) | 51.9% (48.3 - 55.5%) | 0% (0 - 0%) | 0.7% (0 - 1.9%) | 18.4% (16.7 - 19.9%) | 28.9% (27.8 - 29.9%) |
# Document the computational environment for reproducibility
cat("Analysis completed on:", format(Sys.time(), "%Y-%m-%d %H:%M:%S %Z"), "\n\n")## Analysis completed on: 2025-09-16 14:43:16 EDT
## R Session Information:
## ===================
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS 15.3.2
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] patchwork_1.3.1 kableExtra_1.4.0 knitr_1.50 lubridate_1.9.4
## [5] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.1.0
## [9] readr_2.1.5 tidyr_1.3.1 tibble_3.3.0 ggplot2_3.5.2
## [13] tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.10 generics_0.1.4 xml2_1.3.8 stringi_1.8.7
## [5] hms_1.1.3 digest_0.6.37 magrittr_2.0.3 evaluate_1.0.4
## [9] grid_4.4.1 timechange_0.3.0 RColorBrewer_1.1-3 fastmap_1.2.0
## [13] jsonlite_2.0.0 viridisLite_0.4.2 scales_1.4.0 textshaping_1.0.1
## [17] jquerylib_0.1.4 cli_3.6.5 rlang_1.1.6 crayon_1.5.3
## [21] bit64_4.6.0-1 withr_3.0.2 cachem_1.1.0 yaml_2.3.10
## [25] tools_4.4.1 parallel_4.4.1 tzdb_0.5.0 vctrs_0.6.5
## [29] R6_2.6.1 lifecycle_1.0.4 bit_4.6.0 vroom_1.6.5
## [33] pkgconfig_2.0.3 pillar_1.11.0 bslib_0.9.0 gtable_0.3.6
## [37] glue_1.8.0 systemfonts_1.2.3 xfun_0.52 tidyselect_1.2.1
## [41] rstudioapi_0.17.1 farver_2.1.2 htmltools_0.5.8.1 labeling_0.4.3
## [45] rmarkdown_2.29 svglite_2.2.1 compiler_4.4.1